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Abstract —This paper is concerned with the problem of low- 
rank plus sparse matrix decomposition for big data. Conventional 
algorithms for matrix decomposition use the entire data to extract 
the low-rank and sparse components, and are based on optimiza¬ 
tion problems that scale with the dimension of the data, which 
limit their scalability. Furthermore, the existing randomized 
approaches mostly rely on uniform random sampling, which can 
be quite inefficient for many real world data matrices that exhibit 
additional structures (e.g. clustering). In this paper, a scalable 
subspace-pursuit approach that transforms the decomposition 
problem to a subspace learning problem is proposed. The 
decomposition is carried out using a small data sketch formed 
from sampled columns/rows. Even when the data is sampled 
uniformly at random, it is shown that the sufficient number 
of sampled columns/rows is roughly C2(r/r), where // is the 
coherency parameter and r the rank of the low-rank component. 
In addition, efficient sampling algorithms are proposed to address 
the problem of column/row sampling from structured data. 
The proposed sampling algorithms can be independently used 
for feature selection from high-dimensional data. The proposed 
approach is amenable to online implementation and an online 
scheme is proposed. 

Index Terms —Low-Rank Matrix, Subspace Learning, Big 
Data, Matrix Decomposition, Column Sampling, Sketching 


I. Introduction 

UPPOSE we are given a data matrix D £ R NixN2 , which 
can be expressed as 

D = L + S, (1) 

where L is a low rank (LR) and S is a sparse matrix with 
arbitrary unknown support, whose entries can have arbitrarily 
large magnitude. Many important applications in which the 
data under study can be naturally modeled using 0 were dis¬ 
cussed in 0 - The cutting-edge Principal Component Pursuit 
approach developed in |T|. |2j, directly decomposes D into 
its LR and sparse components by solving the convex program 

min A||S||i + ||L||* 

LS . (2) 

subject to L + S = D 

where ||.||! is the G-norm, ||-||* is the nuclear norm and A de¬ 
termines the trade-off between the sparse and LR components 
0. The convex program 0 can precisely recover both the 
LR and sparse components if the columns and rows subspace 
of L are sufficiently incoherent with the standard basis and the 
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non-zero elements of S are sufficiently diffused {2:]. Although 
the problem in ([2]) is convex, its computational complexity 
is intolerable with large volumes of high-dimensional data. 
Even the efficient iterative algorithms proposed in 0,0 have 
prohibitive computational and memory requirements in high¬ 
dimensional settings. 

Contributions: This paper proposes a new randomized de¬ 
composition approach, which extracts the LR component in 
two consecutive steps. First, the column-space (CS) of L is 
learned from a small subset of the columns of the data matrix. 
Second, the row-space (RS) of L is obtained using a small 
subset of the rows of D. Unlike conventional decomposition 
that uses the entire data, we only utilize a small data sketch, 
and solve two low-dimensional optimization problems in lieu 
of one high-dimensional matrix decomposition problem ([2]) 
resulting in significant mnning time speed-ups. 

To the best of our knowledge, it is shown here for the 
first time that the sufficient number of randomly sampled 
columns/rows scales linearly with the rank r and the coherency 
parameter of L even with uniform random sampling. Also, 
in contrast to the existing randomized approaches 0-0. 
which use blind uniform random sampling, we propose a 
new methodology for efficient column/row sampling. When 
the columns/rows of L are not distributed uniformly in the 
CS/RS of L, which prevails much of the real world data, the 
proposed sampling approach is shown to achieve significant 
savings in data usage compared to uniform random sampling- 
based methods that require remarkable portions of the data. 
The proposed sampling algorithms can be independently used 
for feature selection from high-dimensional data. 

In the presented approach, once the CS is learned, each 
column is decomposed efficiently and independently using the 
proposed randomized vector decomposition method. Unlike 
most existing approaches, which are batch-based, this unique 
feature enables applicability to online settings. The presented 
vector decomposition method can be independently used in 
many applications as an efficient vector decomposition algo¬ 
rithm or for efficient linear decoding 18|— flT0[. 

A. Notation and definitions 

We use bold-face upper-case letters to denote matrices and 
bold-face lower-case letters to denote vectors. Given a matrix 
L, ||L|| denotes its spectral norm, ||L||f its Frobenius norm, 
and IILHoo the infinity norm, which is equal to the maximum 
absolute value of its elements. In an A - -dimensional space, e.j 
is the i th vector of the standard basis (i.e., the I th element of e, 
is equal to one and all the other elements are equal to zero). 
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The notation A = [ ] denotes an empty matrix and the matrix 
A = [Ai A 2 ... A„] 

is the column-wise concatenation of the matrices {A,;}" =1 . 
Random sampling refers to sampling without replacement. 


II. Background and Related Work 


A. Exact LR plus sparse matrix decomposition 

The incoherence of the CS and RS of L is an important re¬ 
quirement for the identifiability of the decompostion problem 
in 0 0. ©■ For the LR matrix L with rank r and compact 


S G 


and 


SVD L = U£V t (where U G 
V G R^ 2xr ), the incoherence condition is typically defined 
through the requirements 0, @ 

pr 

N -j 

(3) 
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and ||UV J 
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for some parameter p that bounds the projection of the 
standard basis {e,} onto the CS and RS. Other useful measures 
for the coherency of subspaces are given in G3 as, 

7(U) = v^Vi max|U(i, j)|,7(V) = v^V2max|V(i, j)|, ( 4 ) 
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where 7 (U) and 7 (V) bound the coherency of the CS and 
the RS, respectively. When some of the elements of the 
orthonormal basis of a subspace are too large, the subspace is 
coherent with the standard vectors. Actually, it is not hard to 
show that max( 7 (V), 7 (U)) < yfp. 

The decomposition of a data matrix into its LR and sparse 
components was analyzed in ||TJ, 0, and sufficient conditions 
for exact recovery using the convex minimization 0 were 
derived. In 0, the sparsity pattern of the sparse matrix is 
selected uniformly at random following the so-called Bernoulli 
model to ensure that the sparse matrix is not LR with over¬ 
whelming probability. In this model, which is also used in 
this paper, each element of the sparse matrix can be non¬ 
zero independently with a constant probability. Without loss 
of generality (w.l.o.g.), suppose that N 2 < N\. The following 
lemma states the main result of [TJ. 

Lemma 1 (Adapted from (fj). Suppose that the support set of 
S follows the Bernoulli model with parameter p. The convex 
program (j^J) with A = yields the exact decomposition 

with probability at least 1 — C\N\~ w provided that 

r < p r N 2 p~ l (log(TVi)) -2 , p< p s (5) 


where p s , C\ and p r are numerical constants. 

The optimization problem in 0 is convex and can be solved 
using standard techniques such as interior point methods 0. 
Although these methods have fast convergence rates, their 
usage is limited to small-size problems due to the high 
complexity of computing a step direction. Similar to the 
iterative shrinking algorithms for fi-norm and nuclear norm 
minimization, a family of iterative algorithms for solving the 
optimization problem 0 were proposed in |3), Q. However, 
they also require working with the entire data. For example. 


the algorithm in {4j requires computing the SVD of an NixN 2 
matrix in every iteration. 

B. Randomized approaches 

Owing to their inherent low-dimensional structures, the 
robust principal component analysis (PCA) and matrix de¬ 
composition problems can be conceivably solved using small 
data sketches, i.e., a small set of random observations of the 
data 0, 0, Q0-Q3J. In (12}, it was shown based on a 
simple degrees-of-freedom analysis that the LR and the sparse 
components can be precisely recovered using a small set of 
random linear measurements of D. A convex program was 


proposed in 121 to recover these components using random 


matrix embedding with a polylogarithmic penalty factor in 
sample complexity, albeit the formulation also requires solving 
a high-dimensional optimization problem. 

The iterative algorithms which solve 0 have complexity 
0(NiN 2 r) per iteration since they compute the partial SVD 
decomposition of N\ x N 2 dimensional matrices |4|. To 
reduce complexity, GoDec 00 uses a randomized method to 
efficiently compute the SVD, and the decomposition algorithm 
in G3 minimizes the rank of <FL instead of L, where $ is 
a random projection matrix. However, these approaches do 
not have provable performance guarantees and their memory 
requirements scale with the full data dimensions. Another 
limitation of the algorithm in 03 is its instability since 
different random projections may yield different results. 

The divide-and-conquer approach in |[5) (and a similar 
algorithm in {TFJ), can achieve super-linear speedups over full- 
scale matrix decomposition. This approach forms an estimate 
of L by combining two low-rank approximations obtained 
from submatrices formed from sampled rows and columns of 
D using the generalized Nystrom method [19}. Our approach 
also achieves super-linear speedups in decomposition, yet is 
fundamentally different from |5J and offers several advantages 
for the following reasons. First, our approach is a subspace- 
pursuit approach that focuses on subspace learning in a 
structure-preserving data sketch. Once the CS is learned, each 
column of the data is decomposed independently using a pro¬ 
posed randomized vector decomposition algorithm. Second, 
unlike [5 ], which is a batch approach that requires to store the 
entire data, the structure of the proposed approach naturally 
lends itself to online implementation (c.f. Section [I V-L} , which 
could be very beneficial for settings where the data comes in 
on the fly. Third, while the analysis provided in {5| requires 
roughly 0(r 2 p 2 max(A 1; N 2 )) random observations to ensure 
exact decomposition with high probability (whp), we show that 
the order of sufficient number of random observations depends 
linearly on the rank and the coherency parameter even if 
uniform random sampling is used. Fourth, the structure of the 
proposed approach enables us to leverage efficient sampling 
strategies for challenging and realistic scenarios in which the 
columns and rows of L are not uniformly distributed in their 
respective subspaces, or when the data exhibits additional 
structures (e.g. clustering structures) (c.f. Sections [IV-B|IV-C} . 
In such settings, the uniform random sampling used in |5} 
requires significantly larger amounts of data to carry out the 
decomposition. 









3 


III. Structure of the Proposed Approach and 
Theoretical Result 

In this section, the structure of the proposed randomized 
decomposition method is presented. A step-by-step analysis 
of the proposed approach is provided and sufficient conditions 
for exact decomposition are derived. Theorem [ 5 ] stating the 
main theoretical result of the paper is presented at the end of 
this section. The proofs of the lemmas and the theorem are 
deferred to the appendix. Let us rewrite 0 as 

D = UQ + S, (6) 

where Q = XV. The representation matrix Q € R rx is a 
full row rank matrix that contains the expansion of the columns 
of L in the orthonormal basis U. The first step of the proposed 
approach aims to learn the CS of L using a subset of the 
columns of D, and in the second step the representation matrix 
is obtained using a subset of the rows of D. 

Let U denote the CS of L. Fundamentally, U can be obtained 
from a small subset of the columns of L. However, since we 
do not have direct access to the LR matrix, a random subset 
of the columns of D is first selected. Hence, the matrix of 
sampled columns D sl can be written as D sl = DS |, where 
Si £ I^x™! is the column sampling matrix and m\ is the 
number of selected columns. The matrix of selected columns 
can be written as 


D s i — L s i T S sl , 


( 7 ) 


where L sl and S sl are its LR and sparse components, respec¬ 
tively. The idea is to decompose the sketch D s i into its LR and 
sparse components to learn the CS of L from the CS of L s i. 
Note that the columns of L s i are a subset of the columns of L 
since L s i = LSi. Should we be able to decompose D s i into 
its exact LR and sparse components (c.f. Lemma [Tji, we also 
need to ensure that the columns of L s i span U. The following 
lemma establishes that a small subset of the columns of D 
sampled uniformly at random contains sufficient information 
(i.e., the columns of the LR component of the sampled data 
span U) if the RS is incoherent. 


Lemma 2 . Suppose m\ columns are sampled uniformly at 
random from the matrix L with rank r. If 


mi > r7 2 (V)max 


^c 2 log r, c 3 log 



( 8 ) 


then the selected columns of the matrix L span the columns 
subspace ofLi with probability at least (1 — S) where c 2 and 
c 3 are numerical constants. 


Thus, if 7(V) is small (i.e., the RS is not coherent), a small 
set of randomly sampled columns can span U. According to 
Lemma[ 2 ] if mi satisfies (|8j, then L and L sl have the same CS 
whp. The following optimization problem (of dimensionality 
Nim\) is solved to decompose D s i into its LR and sparse 
components. 

min 
L s i,S sl 

subject to 


1 


IS.il 


VN~1 

L s i + Ssi = Dsi. 


( 9 ) 


Thus, the columns subspace of the LR matrix can be recovered 
by finding the columns subspace of L s i. Our next lemma 
establishes that 0 yields the exact decomposition using 
roughly mi = Oijir) randomly sampled columns. To simplify 
the analysis, in the following lemma it is assumed that the 
CS of the LR matrix is sampled from the random orthogonal 
model |20) , i.e., the columns of U are selected uniformly at 
random among all families of ? -orthonormal vectors. 

Lemma 3. Suppose the columns subspace of L is sampled 
from the random orthogonal model, L s i has the same column 
subspace of L and the support set of S follows the Bernoulli 
model with parameter p. In addition, assume that the columns 
of D s i were sampled uniformly at random. If 

mi > p (log Ni ) 2 and p < p s , (10) 

p r 

then 0 yields the exact decomposition with probability at 
least 1 — CgNf^, where 

' /C7 max(r, log Ni ) 2 2 \ 

p = max I---,67 (V),(c 9 7(V)logVi) I ( 11 ) 

and C7, C8 and C9 are constant numbers provided that Ni is 
greater than the RHS of first inequality of m 

Therefore, according to Lemma [ 3 ] and Lemma [ 2 ] the CS of 
L can be obtained using roughly 0 (rp ) uniformly sampled 
data columns. Note that mi <C Aq for high-dimensional data 
as mi scales linearly with r. Hence, the requirement that Ni 
is also greater than the RHS of first inequality of ( p~ 0 ] > is by 
no means restrictive and is naturally satisfied. 

Suppose that (| 9 ]l decomposes D s \ into its exact components 
and assume that U has been correctly identified. W.l.o.g., we 
can use U as an orthonormal basis for the learned CS. An 
arbitrary column d, of D can be written as d, = Uq, + s ; , 
where q, and s, are the corresponding columns of Q and S, 
respectively. Thus, d, — Uq; is a sparse vector. This suggests 
that q, can be learned using the minimization 


min|jd; - Uq; 


( 12 ) 


where the l \ -norm is used as a surrogate for the fo-norm to 
promote a sparse solution 0.GD- The optimization problem 
(121 is similar to a system of linear equations with r unknown 
variables and Ni equations. Since r 7C N -\, the idea is to 
learn q; using only a small subset of the equations. Thus, we 
propose the following vector decomposition program 


mm 

q. 


S^d; — S^Uq; 


( 13 ) 


where S 2 £ K Ar i xm 2 se i ec ts m 2 rows of U (and the corre¬ 
sponding m 2 elements of d;). 

First, we have to ensure that the rank of S 2 U is equal to 
the rank of U, for if q* is the optimal point of © then 
Uq* will be the LR component of d, . According to Lemma 
[2] m 2 = C>(r7(U)), is sufficient to preserve the rank of U 
when the rows are sampled uniformly at random. In addition, 
the following lemma establishes that if the rank of U is equal 


to the rank of S 2 U, then the sufficient value of to 2 for ( 131 
to yield the correct columns of Q whp is linear in r. 
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Lemma 4. Suppose that the rank of S] U is equal to the 
rank of L and assume that the CS of L is sampled from the 
random orthogonal model. The optimal point ofW is equal 
to q,; with probability at least (1 — 3(5) provided that 


P < 


0.5 


r/3(c 6 n log ^ + 1 ) 

( 2 r/ 3(/3 - 2) log | 


m2 > max 


V 3(/3 — l ) * 1 2 


, N\ 

CqK log — + 1 
0 


(14) 


C5(l0g-^-) 

where n = log ^ , c .5 and c$ are constant numbers and (S can 
be any real number greater than one. 

Therefore, we can obtain the LR component of each column 


using a random subset of its elements. Since (12 1 is an f^-norm 
minimization, we can write the representation matrix learning 
problem as 


mm | 
Q 


IS 2 D — S 2 UQ|| 


(15) 


Thus, (JT 5 J learns Q using a subset of the rows of D as S^D 
is the matrix formed from m 2 sampled rows of D. 

As such, we solve two low-dimensional subspace pursuit 
problems 0 and ( p~5] > of dimensions Ni mj and N 2 m 2 , 


respectively, instead of an NiN 2 -dimensional decomposition 
problem (|2|, and use a small random subset of the data to learn 
U and Q. The table of Algorithm 1 explains the structure of 
the proposed approach. 

We can readily state the following theorem which estab¬ 
lishes sufficient conditions for Algorithm 1 to yield exact 
decomposition. In this theorem, it is assumed that the columns 
and rows are sampled uniformly at random. In the next section, 
an efficient method for column and row sampling is presented. 
In addition, a scheme for online implementation is proposed. 

Theorem 5. Suppose the columns subspace of the LR matrix is 
sampled from the random orthogonal model and the support 
set of S follows the Bernoulli model with parameter p. In 
addition, it is assumed that Algorithm 1 samples the columns 
and rows uniformly at random. If for any small <5 > 0, 


mi > max ?" 7 2 (V) max C2 log r, C3 log - 


- ^'(loglVi) 2 

Pr / 


/ , , 3 

7772 > max I r log Ni max (c 2 log r, c 3 log -), 

2 r/?(/ 3 - 2 )log(^) 


3(/3-l) 2 


p < min p s , 


r/3 ( c 6 k log^ + 1) 


(16) 


N ± N 2 

Cg K log -h 1 


N\N 2 . 2 a/3 
c 5 (log—) 


0.5 


where 

' /c 7 max(r,logWi) 2 Ar^ 2 * * ^ 

p = max I --- ,67 (V), (c 9 7 (V) log Ni) I, 

log N ± 

k = -, 

r 

{ c i}?= i> c 2 c 3 are constant numbers and (3 is any 

real number greater that one, then the proposed approach 
(Algorithm 1) yields the exact decomposition with probability 
at least (1 — 55 — 3rNf 7 ) provided that N\ is greater than 
the RHS of first inequality o/©. 

Theorem 0 guarantees that the LR component can be 
obtained using a small subset of the data. The randomized ap¬ 
proach has two main advantages. First, it significantly reduces 
the memory/storage requirements since it only uses a small 
data sketch and solves two low-dimensional optimization prob¬ 
lems versus one large problem. Second, the proposed approach 
has C2(max(A r i ,N 2 )x max( 777 i, m 2 ) x r) per-iteration running 
time complexity, which is significantly lower than 0{N\N 2 r) 
per iteration for full scale decomposition 0 0-0 implying 
remarkable speedups for big data. For instance, consider U 
and Q sampled from A/”(0,1), r = 5, and S following the 
Bernoulli model with p = 0.02. For values of Ni = N 2 equal 
to 500, 1000, 5000, 10 4 and 2 x 10 4 , if mi = m 2 = 10r, 
the proposed approach yields the correct decomposition with 
90, 300, 680, 1520 and 4800 - fold speedup, respectively, over 
directly solving 0- 

Algorithm 1 Structure of Proposed Approach 

Input: Data matrix D £ l^ 1 x A '' 2 

1. Initialization: Form column sampling matrix Si £ jj7V 2 xm 1 anc | row 
sampling matrix S 2 £ R Jv i xm2 . 

2. CS Learning 

2.1 Column sampling: Matrix Si samples m\ columns of the given data 
matrix, D s i =DSi. 

2.2 CS learning: The convex program 0 is applied to the sampled columns 
Dsi- 

2.3 CS calculation: The CS is found as the columns subspace of the calculated 
LR component. 

3. Representation Matrix Learning 

3.1 Row sampling: Matrix S 2 samples m 2 rows of the given data matrix, 

D s2 = S^D. 

3.2 Representation matrix learning: The convex problem {0 is applied to 
the sampled rows to find the representation matrix. 

Output: If U is an orthonormal basis for the learned CS and Q is the obtained 
representation matrix, then L = UQ is the obtained LR component. 


IV. Efficient Column/Row Sampling 

In sharp contrast to randomized algorithms for matrix 
approximations rooted in numerical linear algebra (NLA) m- 
| 22 ]]. which seek to compute matrix approximations from sam¬ 
pled data using importance sampling, in matrix decomposition 
and robust PCA we do not have direct access to the LR matrix 
to measure how informative particular columns/rows are. As 
such, the existing randomized algorithms for matrix decompo¬ 
sition and robust PCA |5| |7j. |13) have predominantly relied 
upon uniform random sampling of columns/rows. 
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Fig. 1. Data distributions in a two-dimensional subspace. The red points are 
the normalized data points. 



Fig. 2. The rank of a set of uniformly random sampled columns for different 
number of clusters. 


In Section IV-A we briefly describe the implications of non¬ 


uniform data distribution and show that uniform random sam¬ 
pling may not be favorable for data matrices exhibiting some 
structures that prevail much of the real datasets. In Section 


IV-B we demonstrate an efficient column sampling strategy 
which will be integrated with the proposed decomposition 
method. The decomposition method with efficient column/row 


sampling is presented in Sections IV-C and IV-D 


A. Non-uniform data distribution 

When data points lie in a low-dimensional subspace, a small 
subset of the points can span the subspace. However, uniform 
random sampling is only effective when the data points are 
distributed uniformly in the subspace. To clarify. Fig. [T] shows 
two scenarios for a set of data points in a two-dimensional 
subspace. In the left plot, the data points are distributed 
uniformly at random. In this case, two randomly sampled data 
points can span the subspace whp. In the right plot, 95 percent 
of the data lie on a one-dimensional subspace, thus we may 
not be able to capture the two-dimensional subspace from a 
small random subset of the data points. 

In practice, the data points in a low-dimensional subspace 
may not be uniformly distributed, but rather exhibit some 
additional structures. A prevailing structure in many modern 
applications is clustered data 


124]. For example, user 
ratings for certain products (e.g. movies) in recommender sys¬ 
tems are not only LR due to their inherent correlations, but also 
exhibit additional clustering structures owing to the similarity 
of the preferences of individuals from similar backgrounds 
(e.g. education, culture, or gender) (23) , ||25j. 

To further show that uniform random sampling falls short 
when the data points are not distributed uniformly in the 


subspace, consider a matrix G £ ]gpooox6i50 g enera ted as 
G = [Gi G 2 ... G„] . For 1 < i < §, G* = U,Q, , where 
U. G R 2000 x^ q. g Rs x . For n/2 + l < i < n, 
G, = U,;Qi , where U* £ R 2000x s, Q i £ . The 

elements of U,; and Qi are sampled independently from a 
normal Nf( 0,1) distribution. The parameter r is set equal to 
60, thus, the rank of G is equal to 60 whp. Fig. [2] illustrates 
the rank of the randomly sampled columns versus the number 
of sampled columns for different number of clusters n. As n 
increases, so does the required number of uniformly sampled 
columns. When n = 60, it turns out that we need to sample 
more than half of the columns to span the CS. As such, 
we cannot evade high-dimensionality with uniform random 
column/row sampling. In | |23~| , it was shown that the RS 
coherency increases when the columns follow a more clustered 
distribution, and vice-versa. These observations match our the¬ 
oretical result in Lemma[2j which established that the sufficient 
number of randomly sampled columns depends linearly on the 
coherency parameter of the RS. 


B. Efficient column sampling method 

Column sampling is widely used for dimensionality re¬ 
duction and feature selection (B), (26), (27). In the column 
sampling problem, the LR matrix (or the matrix whose span 
is to be approximated with a small set of its columns) is 
available. Thus, the columns are sampled based on their 
importance, measured by the so-called leverage scores pT) , 
as opposed to blind uniform sampling. We refer the reader to 
Q3), (27) and references therein for more information about 
efficient column sampling methods. 

Next, we present a sampling approach which will be used 
in Section |IV-C| where the proposed decomposition algorithm 
with efficient sampling is presented. The proposed sampling 
strategy is inspired by the approach in (27) in the context 
of volume sampling. The table of Algorithm 2 details the 
presented column sampling procedure. Given a matrix A 
with rank ta, the algorithm aims to sample a small subset 
of the columns of A that span its CS. The first column is 
sampled uniformly at random or based on a judiciously chosen 
probability distribution 1211. The next columns are selected 
sequentially so as to maximize the novelty to the span of the 
selected columns. As shown in step 2.2 of Algorithm 2, a 
design threshold r is used to decide whether a given column 
brings sufficient novelty to the sampled columns by thresh¬ 
olding the f 2 -norm of its projection on the complement of 
the span of the sampled columns. The threshold r is naturally 
set to zero in a noise-free setting. Once the selected columns 
are believed to span the CS of A, they are removed from 
A. This procedure is repeated C times (using the remaining 
columns). In each time, the algorithm finds r columns which 
span the CS of A. After every iteration, the rank of the matrix 
of remaining columns is bounded above by r,\- As such, 
the algorithm samples approximately mi ~ Cta columns 
in total. In the proposed decomposition method with efficient 
column/row sampling (presented in Section IV-Ci, we set C 
large enough to ensure that the selected columns form a low 
rank matrix. 
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Algorithm 2 Efficient Sampling from LR Matrices 
Input: Matrix A. 

1. Initialize 

1.1 The parameter C is chosen as an integer greater than or equal to one. 
The algorithm finds C sets of linearly dependent columns. 

1.2 Set X = 0 as the index set of the sampled columns and set v = r, 
B = A and C = []. 

2. Repeat C Times 

2.1 Let b be a non-zero randomly sampled column from B with index i),. 
Update C and X as C = [C b], X = {X , i),}. 

2.2 While v > r 

2.2.1 Set E = P C B, where P c is the projection matrix onto the complement 
space of span(C). 

2.2.2 Define f as the column of E with the maximum I 2 -norm with index 
if. Update C, X and v as C = [C f] , X = {I, if} and v = 11 f 11 2 ■ 

2.2 End While 

2.3 Set C = [ ] and set B equal to A with the columns indexed by X set to 
zero. 

2. End Repeat 

Output: The set X contains the indices of the selected columns. 



Fig. 3. Visualization of the matrices defined in Section [IV-C Matrix D. UJ 
selected randomly or using Algorithm 3 described in Section IV D| 


is 


C. Proposed decomposition algorithm with efficient sampling 

In this section, we develop a modified decomposition algo¬ 
rithm that replaces uniform random sampling with the efficient 
column/row sampling method (Algorithm 2). In Section [VJ it 
is shown that the proposed technique can remarkably reduce 
the sampling requirement. We consider a setting wherein the 
data points (the columns of L) are not uniformly distributed, 
rather they admit an additional structure (such as clustering), 
wherefore a small subset of uniformly sampled columns is 
not likely to span the CS. However, we assume that the rows 
of L are distributed well enough, in the sense that they do 
not much align along any specific directions, such that C r r 
rows of L sampled uniformly at random span its RS whp, 
for some constant C r . In Section IV-D we dispense with this 
assumption. The proposed decomposition algorithm rests on 
three key ideas detailed next. 

1) Informative column sampling: The first important idea 
underlying the proposed sampling approach is to start sam¬ 
pling along the dimension that has the better distribution. 
For instance, in the example considered in Section IV-A the 
columns of G admit a clustering structure. However, the CS 
of G is a random r-dimensional subspace, which means that 
the rows of G are distributed uniformly at random in the RS 
of G. Thus, in this case we start with row sampling. The main 


intuition is that while almost 60 randomly sampled rows of G 
span the RS, a considerable portion of the columns (almost 
4000) should be sampled to capture the CS as shown in Fig. 

[2] As another example, consider an extreme scenario where 
only two columns of G £ Jgi 1000x1000 

are non-zero. In this 
case, with random sampling we need to sample almost all the 
columns to ensure that the sampled columns span the CS of 
G. But, if the non-zero columns are non-sparse, a small subset 
of randomly chosen rows of G will span its row space. 

Fet r denote a known upper bound on r. Such knowledge 
is often available as side information depending on the par¬ 
ticular application. For instance, facial images under varying 
illumination and facial expressions are known to lie on a 
special low-dimensional subspace |28) . For visualization. Fig. 

[3] provides a simplified illustration of the matrices defined in 
this section. We sample C r r rows of D uniformly at random. 
Fet Du, £ ^(Cr-r)xN 2 denote the matrix of sampled rows. 
We choose C r sufficiently large to ensure that the non-sparse 
component of D„, is a FR matrix. Define L,,,, assumably with 
rank r, as the FR component of D„,. If we locate a subset 
of the columns of L w that span its CS, the corresponding 
columns of L would span its CS. To this end, the convex 
program ([3]) is applied to D„ to extract its FR component 
denoted ti w . Then, Algorithm 2 is applied to L w to find a 
set of informative columns by sampling mi ~ Cr columns. 
In Remark [l] we discuss how to choose C in the algorithm. 
Define L®. as the matrix of columns selected from L,,,. The 
matrix D sl is formed using the columns of D corresponding 
to the sampled columns of L u ,. 

2) CS learning: Similar to the CS learning step of Algo¬ 
rithm 1, we can obtain the CS of L by decomposing D sl . 
However, we propose to leverage valuable information in 
the matrix L®, in decomposing D s i. In particular, if Dis 
decomposed correctly, the RS of L®, would be same as that 
of L s i given that the rank of Lu, is equal to r. Fet V s i be an 
orthonormal basis for the RS of L®,. Thus, in order to learn 
the CS of D sl , we only need to solve 

minUD^-UVjj!. (17) 

u 


Remark 1. Define as the i' h row of D sl . According to 
(17), U* (the i th row of U) is obtained as the optimal point 

of 


min||(d‘ 1 ) T -V al (U * i ) r || 1 


(18) 


u ; 


Based on the analysis provided for the proof of Lemma the 
optimal point of (18) is equal to U* if mi > r + 77 HS*!||o» 
where is the 1 ™ row of S s i, ||S* 2 ||o the number of non¬ 
zero elements of S* 1; and rj a real number which depends on 
the coherency of the subspace spanned by V s i. Thus, here C 
is determined based on the rank of L and the sparsity of S, 
i.e., Cr — r has to be sufficiently greater than the expected 
value for the number of non-zero elements of the rows of S s i. 


Remark 2. We note that the convex algorithm © may not 
always yield accurate decomposition of D m since structured 
data may not be sufficiently incoherent suggesting 

that the decomposition step can be further improved. Let 
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be the matrix consisting of the columns ofT) w corresponding 
to the columns selected from L w to form L®,. According to our 
investigations, an improved V sl can be obtained by applying 
the decomposition algorithm presented in / |29j f to D®, then use 
the RS 0 /I 4 as an initial guess for the RS of the non-sparse 
component of T) s w . Since D^, is low-dimensional (roughly 
0(r) x 0(r) dimensional matrix), this extra step is a low 
complexity operation. 

3) Representation matrix learning: Suppose that the CS of 
L was learned correctly, i.e., the span of the optimal point of 
0 is equal to the span of U. Thus, we use U as a basis for 
the learned CS. Now we leverage the information embedded 
in U to select the informative rows. Algorithm 2 is applied to 
U T to locate m 2 ~ Cr rows of U. Thus, we form the matrix 
D s2 from the rows of D corresponding to the selected rows 
of U. Thus, the representation matrix is learned as 

min||D s2 - U s2 Q||i , (19) 

Q 

where U s2 £ R m2Xr j s t jj e ma ti-i x 0 f the selected rows of U. 
Subsequently, the LR matrix can be obtained from the learned 
CS and the representation matrix. 


D. Column/Row sampling from sparsely corrupted data 

In Section IV-C[ we assumed that the LR component of 
D u , has rank r. However, if the rows are not well-distributed, 
a reasonably sized random subset of the rows may not span 
the RS of L. Here, we present a sampling approach which 
can find the informative columns/rows even when both the 
columns and the rows exhibit clustering structures such that a 
small random subset of the columns/rows of L cannot span its 
CS/RS. The algorithm presented in this section (Algorithm 3) 
can be independently used as an efficient sampling approach 
from big data. In this paper, we use Algorithm 3 to form D w 
if both the columns and rows exhibit clustering structures. 

The table of Algorithm 3, Fig. [4] and its caption provide the 
details of the proposed sampling approach and the definitions 
of the used matrices. We start the cycle from the position 
marked “I” in Fig. [4] with D w formed according to the 
initialization step of Algorithm 3. For ease of exposition, 
assume that L w = L w and L c = L c , i.e., D„, and D c 
are decomposed correctly. The matrix L®, is the informative 
columns of L w . Thus, the rank of L®, is equal to the rank 
of L w . Since h w = L w , L®, is a subset of the rows of L c . 
If the rows of L exhibit a clustering structure, it is likely 
that rank(L^) < rank(L c ). Thus, rank(L lu ) < rank(L c ). 
We continue one cycle of the algorithm by going through 
steps 1, 2 and 3 of Fig. [4] to update D„.. Using a similar 
argument, we see that the rank of an updated L w will be 
greater than the rank of L c . Thus, if we run more cycles 
of the algorithm - each time updating D, i; and D c - the 
rank of L„ ; and L 0 will increase. As detailed in the table 
of Algorithm 3, we stop if the dimension of the span of the 
obtained LR component does not change in T consecutive 
iterations. While there is no guarantee that the rank of L„, 
will converge to r (it can converge to a value smaller than 
r), our investigations have shown that Algorithm 3 performs 
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Fig. 4. Visualization of Algorithm 3. We run few cycles of the algorithm 
and stop when the rank of the LR component does not change over T 
consecutive steps. One cycle of the algorithm starts from the point marked 
“I” and proceeds as follows. I: Matrix is decomposed and is the 
obtained LR component of D^. II: Algorithm 2 is applied to lu w to select 
the informative columns of L^. LJ, is the matrix of columns selected from 
L^. Ill: Matrix D c is formed from the columns of D that correspond to 
the columns of L^. 1: Matrix D c is decomposed and L c is the obtained LR 
component of D c . 2: Algorithm 2 is applied to L ^ to select the informative 
rows of L c . L® is the matrix of rows selected from L c . 3: Matrix is 
formed as the rows of D corresponding to the rows used to form L£. 


quite well and the RS of L„, converges to the RS of L in few 
steps. We have also found that adding some randomly sampled 
columns (rows) to D,.(D„,) can effectively avert converging 
to a lower dimensional subspace. For instance, some randomly 
sampled columns can be added to D c , which was obtained by 
applying Algorithm 2 to L,„. 

Algorithm 3 Efficient Column/Row Sampling from Sparsely 
Corrupted LR Matrices 

1. Initialization 

Form Dm S MP TrxN2 by randomly choosing C r f rows of D. Initialize 
k = 1 and set T equal to an integer greater than 1. 

2. While k > 0 

2.1 Sample the most informative columns 

2.1.1 Obtain L w via JiJ as the LR component of D M; . 

2.1.2 Apply Algorithm 2 to L„ with C = C r . 

2.1.3 Form the matrix D c from the columns of D corresponding to the 
sampled columns of i w . 

2.2 Sample the most informative rows 

2.2.1 Obtain L c via { 2 ) as the LR component of D c . 

2.2.2 Apply Algorithm 2 to LjT with C = C r . 

2.2.3 Form the matrix D m from the rows of D corresponding to the sampled 
rows of L c . 

2.3 If the dimension of the RS of L u; does not increase in T consecutive 
iterations, set k = 0 to stop the algorithm. 

2. End While 

Output: The matrices and L„, can be used for column sampling in the 
first step of the Algorithm presented in Section |TV-C| 


Algorithm 3 was found to converge in a very small number 
of iterations (typically less than 4 iterations). Thus, even 
when Algorithm 3 is used to form the matrix D m , the order 
of complexity of the proposed decomposition method with 
efficient column/row sampling (presented in Section IV-Cl is 
roughly C>(max(^Vi, W 2 )r 2 ). 


E. Online Implementation 

The proposed decomposition approach consists of two main 
steps, namely, learning the CS of the LR component then 


































decomposing the columns independently. This structure lends 
itself to online implementation, which could be very beneficial 
in settings where the data arrives on the fly. The idea is to 
first learn the CS of the LR component from a small batch 
of the data and keep tracking the CS. Since the CS is being 
tracked, any new data column can be decomposed based on 
the updated subspace. The table of Algorithm 4 details the 
proposed online matrix decomposition algorithm, where d, 
denotes the /. lh received data column. 

Algorithm 4 uses a parameter n u which determines the rate 
at which the algorithm updates the CS of the LR component. 
For instance, if n u = 20, then the CS is updated every 20 
new data columns (step 2.2 of Algorithm 4). The parameter 
n u has to be set in accordance with the rate of the change 
of the subspace of the LR component; a small value for n u 
is used if the subspace is changing rapidly. The parameter n a 
determines the number of columns last received that are used 
to update the CS. If the subspace changes rapidly, the older 
columns may be less relevant to the current subspace, hence a 
small value for n s is used. On the other hand, when the data 
is noisy and the subspace changes at a slower rate, choosing 
a larger value for n s can lead to more accurate estimation of 
the CS. 


Algorithm 4 Online Implementation 

1. Initialization 

1.1 Set the parameters n u and n s equal to integers greater than or equal to 
one. 

1.2 Form D () e R Nl x ( c »-0 as 

Do = [di d2 ... d o T f]- 

Decompose Do using (2) and obtain the CS of its LR component. Define 
U D as the learned CS, Q 0 the appropriate representation matrix and S the 
obtained sparse component of Do- 

1.3 Apply Algorithm 2 to to construct the row sampling matrix S 2 . 

2. For any new data column dt do 

2.1 Decompose dt as 

minllSl'dt — Sj’Uoqt||i , (20) 

q t 

and update 

Qo [Qo qj], S «— [S (dt — U 0 q£)], where q£ is the optimal point of 

m- 

2.2 If the remainder of — is equal to zero, update U 0 as 

min||Dt - U0Q0II1 i ( 21 ) 

U 0 

where Q* is the last n s r columns of Q 0 and Dt is the matrix formed from 
the last n s r received data columns. Apply Algorithm 2 to the new to 
update the row sampling matrix S 2 . 

2. End For 

Output The matrix S as the obtained sparse matrix, L = D — S as the 
obtained LR matrix and U 0 as the current basis for the CS of the LR 
component. 


F Noisy Data 

In practice, noisy data can be modeled as 


( 22 ) 


where N is an additive noise component. In (|30jj, it was shown 
that the program 


min 

L.S 

subject to 


AllSHi + ||L||, 

||L + S - D|| f < e, 


(23) 


can recover the LR and sparse components with an error bound 
that is proportional to the noise level. The parameter e n has 
to be chosen based on the noise level. This modified version 
can be used in the proposed algorithms to account for the 
noise. Similarly, to account for the noise in the representation 
learning problem © the Li -norm minimization problem can 
be modified as follows; 


min||S^D — S^UQ — E||i subject to ||E||,f < S n . (24) 

Q,E 

E £ K m2XiV2 is used to cancel out the effect of the noise and 
the parameter 6 n is chosen based on the noise level (31). 


V. Numerical Simulations 

In this section, we present some numerical simulations to 
study the performance of the proposed randomized decomposi¬ 
tion method. First, we present a set of simulations confirming 
our analysis which established that the sufficient number of 
sampled columns/rows is linear in r. Then, we compare the 
proposed approach to the state-of-the-art randomized algo¬ 
rithm 0 and demonstrate that the proposed sampling strategy 
can lead to notable improvement in performance. We then 
provide an illustrative example to showcase the effectiveness 
of our approach on real video frames for background subtrac¬ 
tion and activity detection. Given the structure of the proposed 
approach, it is shown that side information can be leveraged 
to further simplify the decomposition task. In addition, a 
numerical example is provided to examine the performance 
of Algorithm 3. Finally, we investigate the performance of the 
online algorithm and show that the proposed online method 
can successfully track the underlying subspace. 

In all simulations, the Augmented Lagrange multiplier 
(ALM) algorithm Q, @ is used to solve the optimization 
problem |2]). In addition, the L -magic routine [ 32j is used to 
solve the fj-norm minimization problems. It is important to 
note that in all the provided simulations (except in Section 
V-D| >, the convex program (|2]) that operates on the entire data 
can yield correct decomposition with respect to the considered 
criteria. Thus, if the randomized methods cannot yield correct 
decomposition, it is because they fall short of acquiring the 
essential information through sampling. 


A. Phase transition plots 

In this section, we investigate the required number of ran¬ 
domly sampled columns/rows. The LR matrix is generated as 
a product L = U r Q r , where U r £ M AriXr and Q r £ W xN2 . 
The elements of U r and Q r are sampled independently from 
a standard normal Af(0, 1) distribution. The sparse matrix S 
follows the Bernoulli model with p = 0.02. In this experi¬ 
ment, Algorithm 1 is used and the column/rows are sampled 
uniformly at random. 


D=L+S+N, 
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r = 5 , p = 0.05 r = 15 , p = 0.05 r = 25,p = 0.05 



50 100 150 200 SO 100 150 200 50 100 150 200 


r = 5 , p = 0.05 r = 5 , p = 0.15 r = 5,p = 0.3 



Fig. 5. Phase transition plots for various rank and sparsity levels. White desig¬ 
nates successful decomposition and black designates incorrect decomposition. 


N 1 = N 2 = 1000 N n = N 2 = 2000 N.=N 2 =5000 



50 100 150 200 50 100 150 200 50 100 150 200 


Fig. 6. Phase transition plots for various data matrix dimensions. 


Fig. [5] shows the phase transition plots for different numbers 
of randomly sampled rows/columns. In this simulation, the 
data is a 1000 x 1000 matrix. For each (mi, m 2 ), we generate 
10 random realizations. A trial is considered successful if the 
recovered LR matrix L satisfies f < 5x 10 -3 . It is clear 
that the required number of sampled columns/rows increases 
as the rank or the sparsity parameter p are increased. When the 
sparsity parameter is increased to 0.3, the proposed algorithm 
can hardly yield correct decomposition. Actually, in this case 
the matrix S is no longer a sparse matrix. 

The top row of Fig. [5] confirms that the sufficient values for 
mi and m 2 are roughly linear in r. For instance, when the rank 
is increased from 5 to 25, the required value for mi increases 
from 30 to 140. In this experiment, the column and RS of 
L are sampled from the random orthogonal model. Thus, the 
CS and RS have small coherency whp (20 ]. Therefore, the 
important factor governing the sample complexity is the rank 
of L. Indeed, Fig. [6] shows the phase transition for different 
sizes of the data matrix when the rank of L is fixed. One 
can see that the required values for m\ and m 2 are almost 
independent of the size of the data confirming our analysis. 



Fig. 7. Performance of the proposed approach and the randomized algorithm 
in (5). A value 1 indicates correct decomposition and a value 0 indicates 
incorrect decomposition. 


For 1 < i < §, 

Gi , 

where Uj G K 2000x Q- g R7 x and the elements of U ?: 
and Q, are sampled independently from a normal distribution 

Af( 0,1). For n/2 + 1 < i < n, 

G i = 13U, Qi, 

where U, € R 2000x s, Q, g and the elements 

of U, and Q, are sampled independently from an A/"(0,1) 
distribution. We set r equal to 60; thus, the rank of L is equal 
to 60 whp. The sparse matrix S follows the Bernoulli model 
and each element of S is non-zero with probability 0.02. In 
this simulation, we do not use Algorithm 3 to form D„ . The 
matrix D„ is formed from 300 uniformly sampled rows of D. 

We evaluate the performance of the algorithm for different 
values of n, i.e., different number of clusters. Fig. [7] shows 
the performance of the proposed approach and the approach 
in (51 for different values of mi and m 2 . For each value 
of mi = m 2 , we compute the error in LR matrix recovery 
averaged over 10 independent runs, and conclude 
that the algorithm can yield correct decomposition if the 
average error is less than 0.01. In Fig. [7J the values 0, 1 
designate incorrect and correct decomposition, respectively. It 
can be seen that the presented approach requires a significantly 
smaller number of samples to yield the correct decomposition. 
This is due to the fact that the randomized algorithm |5J 
samples both the columns and rows uniformly at random 
and independently. In sharp contrast, we use L„. to find the 
most informative columns to form D s i, and also leverage the 
information embedded in the CS to find the informative rows 
to from D s2 . One can see that when n = 60, |5j cannot yield 
correct decomposition even when mi = m 2 = 1800. 


B. Efficient column/row sampling 

In this experiment, the algorithm presented in Section IV-C 
is compared to the randomized decomposition algorithm in (5J. 
It is shown that the proposed sampling strategy can effectively 
reduce the required number of sampled columns/rows, and 
makes the proposed method remarkably robust to structured 
data. In this experiment, D is a 2000 x 4200 matrix. The LR 
component is generated as 

L = [Gi G 2 ... G„] . 


C. Vector decomposition for background subtraction 

The LR plus sparse matrix decomposition can be effectively 
used to detect a moving object in a stationary background jl], 
1331. The background is modeled as a LR matrix and the mov¬ 


ing object as a sparse matrix. Since videos are typically high 
dimensional objects, standard algorithms can be quite slow for 
such applications. Our algorithm is a good candidate for such 
a problem as it reduces the dimensionality significantly. The 
decomposition problem can be further simplified by leveraging 
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Fig. 8. Stationary background. 



Fig. 9. Two frames of a video taken in a lobby. The first column displays 
the original frames. The second and third columns display the LR and sparse 
components recovered using the proposed approach. 


prior information about the stationary background. In partic¬ 
ular, we know that the background does not change or we 
can construct it with some pre-known dictionary. For example, 
consider the video from (34), which was also used in |[TJ . Few 
frames of the stationary background are illustrated in Fig. [8] 
Thus, we can simply form the CS of the LR matrix using 
these frames which can describe the stationary background 
in different states. Accordingly, we just need to learn the 
representation matrix. As such, the background subtraction 
problem is simplified to a vector decomposition problem. Fig. 
[9] shows that the proposed method successfully separates the 
background and the moving objects. In this experiment, 500 
randomly sampled rows are used (i.e., 500 randomly sampled 
pixels) for the representation matrix learning While the 
running time of our approach is just few milliseconds, it takes 
almost half an hour if we use (2| to decompose the video file 

0 - 


D. Alternating algorithm for column sampling 

In this section, we investigate the performance of Algorithm 
3 for column sampling. The rank of the selected columns is 
shown to converge to the rank of L even when both the rows 
and columns of L exhibit a highly structured distribution. To 
generate the LR matrix L we first generate a matrix G as 
in Section IV-A but setting r = 100. Then, we construct the 
matrix U g from the first r right singular vectors of G. We 
then generate G in a similar way and set V g equal to the first 
r right singular vectors of G. Let the matrix L = U g Vj. 
For example, for n = 100, L £ ]^i0250x 10250 ]sj ote 
resulting LR matrix is nearly sparse since in this simulation 
we consider a very challenging scenario in which both the 
columns and rows of L are highly structured and coherent. 
Thus, in this simulation we set the sparse matrix equal to zero 
and use Algorithm 3 as follows. The matrix D c is formed 
using 300 columns sampled uniformly at random and the 
following steps are performed iteratively: 

1. Apply Algorithm 2 to Dj with C = 3 to sample approxi¬ 



Fig. 10. The rank of the matrix of sampled columns. 


mately 3 r columns of D' r and form D„ from the rows of D 
corresponding to the selected rows of D c . 

2. Apply Algorithm 2 to with C = 3 to sample approxi¬ 
mately 3r columns of D„, and form D c from the columns 
of D corresponding to the selected columns of D c . Fig. 
[TOl shows the rank of D c after each iteration. It is evident 
that the algorithm converges to the rank of L in less than 3 
iterations even for n = 100 clusters. For all values of n, i.e., 
n £ {2,50,60}, the data is a 10250 x 10250 matrix. 


E. Online Implementation 

In this section, the proposed online method is examined. 
It is shown that the proposed scalable online algorithm tracks 
the underlying subspace successfully. The matrix S follows the 
Bernoulli model with p = 0.01. Assume that the orthonormal 
matrix U £ R ;V| xr spans a random r-dimensional subspace. 
The matrix L is generated as follows. 

For k from 1 to iV 2 

1. Generate E S I A,lXr and q 6 K rxl randomly. 

2. L = [L Uq] . 

3. If (mod(fc, n) = 0) 

U = approx-r(U + aE). 

End If 
End For 


The elements of q, and E are sampled from standard normal 
distributions. The output of the function approx-r is the matrix 
of the first r left singular vectors of the input matrix and 
mod(/.:, n) is the remainder of k/n. The parameters a and 
n control the rate of change of the underlying subspace. The 
subspace changes at a higher rate if a is increased or n is 
decreased. In this simulation, n = 10, i.e., the CS is randomly 
rotated every 10 new data columns. In this simulation, the 
parameter r = 5 and iVi = 400. We compare the performance 
of the proposed online approach to the online algorithm 
in (35) . For our proposed method, we set C = 20 when 
Algorithm 2 is applied to U, i.e., 20r rows of U are sampled. 
The method presented in (35) is initialized with the exact 
CS and its tuning parameter is set equal to 1 / y/N\. The 
algorithm (35) updates the CS with every new data column. 
The parameter n u of the proposed online method is set equal 
to 4 (i.e., the CS is updated every 4 new data columns) and 
the parameter n s is set equal to 5r. Define L as the recovered 
LR matrix. Fig. 


11 shows the C-norm of the columns of 










Proof of lemma [3] 

The sampled columns are written as 


11 



Fig. 11. Performance of the proposed online approach and the online 
algorithm in | |35| . 


L — L normalized by the average ( 2 -norm of the columns 
of L for different values of a. One can see that the proposed 
method can successfully track the CS while it is continuously 
changing. The online method (35) performs well when the 
subspace is not changing (a = 0), however, it fails to track 
the subspace when it is changing. 


Appendix 


Proof of Lemma [2] 

The selected columns of L can be written as L s i = LSi. 
Using the compact SVD of L, L sl can be rewritten as L sl = 
US V 7 Si. Therefore, to show that the CS of L sl is equal to 
that of L, it suffices to show that the matrix V 7 Si is a full 
rank matrix. The matrix Si selects mi rows of V uniformly 
at random. Therefore, using Theorem 2 in [11], if 


mi > r7 2 (V)max 


^c 2 log r, c 3 log 



(25) 


then the matrix V 7 Si satisfies the inequality 


D s i — DSi — L s i + S si . (29) 

First, we investigate the coherency of the new LR matrix L sl . 
Define P s r v as the projection matrix onto the CS of SjV 
which is equal to the rows subspace of L s i. Therefore, the 
projection of the standard basis onto the rows subspace of 
L s i can be written as 


max||P S T V e l || 2 = max||Sf V(V T S 1 SfV)- 1 V T Sie i || 2 

< max||SfV(V T SiS 7, V)- 1 V T e J || 2 

< ||SfV(V T SiSfV)- 1 || 2 ||V T e,|| 2 ‘- i() 

< 7 2 (V)r a\ _ 7 2 (V)r 6N 2 _ (67 2 (V))r 
— N 2 erf N 2 mi m\ 

where (SfV(V T SiSfV) _1 V T Si) is the projection matrix 
onto the CS of S^V. The first inequality follows from the 
fact that {S-| e,}^^ is a subset of {e,,-} 7 ^. The second 
inequality follows from Cauchy-Schwarz inequality and the 
third inequality follows from (|4]) and •113- 
Using lemma 2.2 of (20) , there exists numerical constants 
C7, cs such that 

max||U T ej||| < (31) 

* 1 \ i 


with probability at least 1 — c$Nf 3 and p, p = c 7 max (^ 1 °g jv i) ^ 
In addition, we need to find a bound similar to the third 
condition of ^ for the LR matrix L s i. Let L s i = U s iS s iV^i 
be the SVD decomposition of L s i. Define 


H = U s iVj;=^U: i (V: i ) T (32) 

i= 1 


'V T SiSfV|| < l (26) 

TOi 2 

with probability at least (1 — 6), where c 2 ,c3 are numerical 
constants 0 Accordingly, if cr\ and a r denote the largest 
and smallest singular values of Sf V, respectively, then 


(27) 


mi 9 , 3mi 

2N 2 - a i- a r< ^ 

Therefore, the singular values of the matrix V T S! are greater 
than JTjrj^. Accordingly, the matrix V T Si is a full rank 
matrix. 


Remark 3. A direct application of Theorem 2 in 0 would 
in fact lead to the sufficient condition 


nil > r7 2 (R)max 


^c 2 log r, c 3 log 



(28) 


where R £ R^x* 2 denotes the matrix of right singu¬ 
lar vectors of L. The bound in 0 is slightly tighter 
since it uses the incohe renc e parameter 7(V) < 7(R) = 
if N '2 max., -,- |R(i,j)| in (28), where V consists of the first r 
columns of R. This follows easily by replacing the incoherence 
parameter in the step that bounds the £ 2 -norm of the row 
vectors of the submatrix in the proof of ( 0, Theorem 2). 


where U* 1 is the z th column of U sl and V' sl is the z* column 
of V s i. Given the random orthogonal model of the CS, H has 
the same distribution as 

r 

H =Y,^h(Vh) T (33) 

i=l 


where {ej} is an independent Rademacher sequence. Using 
Hoeffding’s inequality (36) , conditioned on U,-| and V s i we 
have 


— t z 

P(|H'(»,j)| >t) <2e^, 

r 

hi = '£(y sl (i,k)) 2 (V sl (j,k)) 2 

k=1 


(34) 


Consider the following lemma adapted from Lemma 2.2 of 

( 2 §. 


Lemma 6 (Adapted from lemma 2.2 of (20)). If the or¬ 
thonormal matrix U follows the random orthogonal model, 
then 


P ^|U(z, j)| 2 > 20—^—^ <3 Nf 8 . 


(35) 
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Therefore, 


|U sl (i,fc)| 2 < 20 


logiVi 

N, 


(36) 


with probability at least 1 — 8 iV, . Therefore, we can bound 


h 2 ij as 


^<20^||V siei " 2 


“'*112 


Using (30), (37) can be rewritten as 

... . 19n 1UBJv 1 ' 2 


hi < 120 !5i^Wr 

Nimi 


(37) 


(38) 


Choose t = cu ~ ; V; vr for some constant u>. Thus, the uncon¬ 


ditional form of ([34]) can be written as 

p ( |H ' (i ' J)l> “Sr£) £2e 

o logiV 1 7 2 (V)r 

hi > 120 Ar V J 

Nimi 


log N ± _|_ 


*7 — 


(39) 


m i > - ^'(logiVi) 2 
P r 

P< Ps 


(41) 


(42) 


D s2 = S 2 D = S^L + S^S = L 


J s 2 


5 s 2 


(44) 


optimal point of (45 i is S * 2 with probability at least (1 — S) 
provided that 


m 2 


m 2 —r > max ( c 4 ||S I s 2 ||o 7 7 (U^ 2 ) log c 5 (log 


(46) 


for some fixed numerical constants c 4 and C 5 . The parameter 

7 (uy = v^ max l u s 2 (*)i)l and l|S* 2 ||o is the fo-norm 

i,o 

of S* 2 - In this paper, we do not assume that the sign of the 
non-zero elements of the sparse matrix S is random. However, 
according to Theorem 2.3 of |Tj (de-randomization technique) 
if the locations of the nonzero entries of S follow the Bernoulli 
model with parameter 2 p, and the signs of S are uniformly 


random and if (45) yields the exact solution who, then it is 


for some numerical constant £. Setting ui = Q logiVi where 
( is a sufficiently large numerical constant gives 

IH'lloo > Cglog< 3 rNr 7 (40) 

for come constant number eg since ( |36| ) should be satisfied for 
rNi random variables. 

Therefore, according to lemma [l] if 


also exact with at least the same probability for the model in 
which the signs are fixed and the locations follow the Bernoulli 
model with parameter p 0 . Therefore, it suffices to provide 
the sufficient condition for the exact recovery of a random sign 
sparse vector with Bernoulli parameter 2 p. 

First, we provide sufficient conditions to guarantee that 

m 2 -r> c 4 ||S l s 2 ||o 7 2 (U 2 2 )log^ (47) 

with high probability. Using Lemma [ 6 ] and the union bound, 

i2^on~°"- ( 48 ) 


max l u s 2 (t, j)! 2 < 20 lQg m2 

t,J m2 


-6 


with probability at least 1 — 3 m 2 

Now, we find the sufficient number of randomly sampled 


rows, m 2 , to guarantee that (47 1 is satisfied with high probabil- 

Therefore, 


ity. It is obvious that m 2 < N-y. Define k = 
it is sufficient to show that 


log jVl 


then, the convex algorithm (|9]) yields the exact decomposition 
with probability at least 1 — CgN p 3 where 

P = max (^ p , 67 2 (V),( 7 (V)c 9 logiVi) 2 ) . (43) 


Proof of lemma [4] 

Based on 0 . the matrix of sampled rows can be written as 


m 2 

sSk 


JVi 

> r ( cq k log — + 1 


whp, where eg = 20c 4 . Suppose that 

1 

P< 


fir (c 6 Klog^- + l) 


(49) 


(50) 


where [3 is a real number greater than one. Define a = 
r (cf;,K log 7 + l). According to (pOb and the Chernoff Bound 


Let L s2 = U s 2 S s 2 Vj 2 be the compact SVD decomposition 
of L s2 and L s2 = U^ 2 Sg 2 (Vg 2 ) T its complete SVD. It can 
be shown | 8 j that (13 1 is equivalent to 


for Binomial random variables (37J7 we have 

1 si 


m<2 , 

3 s 2 11 0 - > a ) < exp 


n(m 2 a\ 
*\a (3 3 ', 


(51) 


subject to (U^) z t = (U, 2 ) J S 


(45) 


s2 


where S * 2 is the i th column of S s2 and is the last (m 2 —r) 
columns of U ': 2 which are orthogonal to U s2 . In other words, 
if q* is the optimal point of CD and z* € W 712 is the optimal 
point of (45 1 , then z* = (d, — Uq*). Thus, it is enough to 

show that the optimal point of (45 1 is equal to S* 2 . 


If we set a = 7 (l ~ jj'j . then the inequality (49 1 is satisfied. 
Therefore, <du can be rewritten as 


< 2 exp 


m I, m 2 m 2 
Ss2l|o_ ^ > ^ 
-mj(i3 - l) 2 


1 

3 af3 


a 2 (3 2 2to 2 (/3 + 2) 


The columns subspace of U s2 obeys the random orthogonal 
model. Thus, U ^ 2 can be modeled as a random subset of U(j 2 . 
Based on the result in ©. if we assume that the sign of 
the non-zero elements of S ' 2 are uniformly random, then the 


Therefore, if 


2r/3(/3-2)log(i) ( N ± 

m2 - ——r ,slos T 


(52) 


( 53 ) 
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then the inequality ( |49| is satisfied with probability at least 
(1 — 5). Accordingly, if 


P < 


0.5 


r/3 (c 6 /tlog ^ + 1) 


ni 2 > max 


( 2rf3(f3 — 2) log 

l 3(>S - 1)=* 


Cq K log — + 1 


C5 


1 N ' 

log T 


(54) 


then ( |45] i returns the exact sparse vector with probability at 
least 1 — 35. The factor 0.5 in the numerator of the right 
hand side of the first inequality of ( |54| > is due to the de¬ 
randomization technique ||T]| to provide the guarantee for the 
fixed sign case. 

Proof of Theorem [5] 

The proposed decomposition algorithm yields the exact de¬ 
composition if: 

1. The sampled columns of the LR matrix span the the columns 
subspace of L. Lemma [2] provides the sufficient conditions on 
mi to guarantee that the columns of L.,-| span U with high 
probability. 

2. The program (|9]( yields the correct LR and sparse compo¬ 
nents of D s i. Lemma [3] provides the sufficient conditions on 
7711 and p to guarantee that D sl is decomposed correctly with 
high probability. 

3. The sampled rows of the LR matrix span the rows subspace 
of L. Since it is assumed that the CS of L is sampled from 
the random orthogonal model, according to Lemma [6] 


P (flT - 2 ° 1 °^ Vl ) < 

Therefore, according to Lemma @if 

m 2 > r log iVi max I c 2 log r, c 3 log 


3 riVf 7 . 



(55) 


(56) 


then the selected rows of the matrix L span the rows subspace 
of L with probability at least (1 — 5 — 3rN -f 7 ) where c 2 and 


c 3 are numerical constants. 


4. The minimization GD yields the correct RS. Lemma [4] 
provides the sufficient conditions to ensure that ( fl3| yields 
the correct representation vector. In order to guarantee the 
performance of (15 1 , we substitute 5 with S/N 2 since © has 
to return exact representation for the columns of D. Therefore, 


P (Incorrect Decomposition) < 5 + c^N 1 3 + 5 + 3rN 1 ‘ + 35. 
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